This code implements the GPstruct learning algorithm:
The code has been tested with Python 2.7.8 and 3.4.1, Numpy 1.8.1, Numba 0.13.3, on Ubuntu 12.04.
There are basically two groups of code files:
Chain data: we used 4 tasks from the CRF++ dataset. The .x files have got sparse feature vectors for each position in the chain (ie word). The .y vectors have got the reference labels for each word. So my program is not concerned with feature extraction at all.
learn_predic_gpstruct ca line 218, computation of f_star_mean: Note that because of eq 4 (arXiv preprint), the kernel matrices need to contain only data for k_x, not for k_u (which contains k_x blocks on the diagonal). Therefore when eg multiplying with an f, you need to do as if you were repeating the kernel matrix on the diagonal, and multiply by blocks of f every time (iterating over labels y_t).
Sampling from f* (as opposed to taking the mode/ mean f*_MAP as now) is not implemented yet, it doesn't seem to bring a performance improvement, cf arXiv paper.
Not implemented either at this point (working on it): hyperparameter sampling.
The ICML paper requires splitting up training and prediction in order to do ensemble learning. That's not implemented yet. It requires being able to compute pseudo-likelihoods on grids where only some pixels are seen (ie there are occluded pixels); that's implemented in the PL code (cf visible_pixels argument).
hashable_compute_kStarTKInv_unary: will cache the computation of this kernel matrix, in case it can be re-used for a subsequent run on the same data, if it is still in memory.
In [1]:
pygpstruct_location = '/home/sb358/pygpstruct_demo'
! rm -rf /home/sb358/pygpstruct_demo
! cd /home/sb358/pygpstruct-master
! mkdir /home/sb358/pygpstruct_demo
! git archive master | tar -x -C /home/sb358/pygpstruct_demo
In [2]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append(pygpstruct_location + '/src/') # replace by your path to .py files
In [3]:
import prepare_from_data_chain
n_data=10
!rm -rf /tmp/pygpstruct_demo
prepare_from_data_chain.learn_predict_gpstruct_wrapper(data_indices_train = np.arange(n_data), # training data= first 10 data files
data_indices_test = np.arange(n_data, n_data*2), # test data= next 10 files
result_prefix='/tmp/pygpstruct_demo/', # where to store the result files
data_folder = pygpstruct_location + '/data/japanesene/',
n_samples=501, # how many MCMC iterations
task='japanesene',
prediction_thinning=100
)
I originally implemented the forwards-backwards algorithm used to compute the likelihood in the MCMC algorithm used for training in Numba. Because this is the speed bottleneck for training, I have reimplemented it in C as a Python module, which needs to be compiled separately:
In [4]:
! cd /home/sb358/pygpstruct-master/src/chain_forwards_backwards_native/; python setup.py install
Then the native implementation can be used instead of the default Numba implementation by passing the native_implementation=True argument; that makes the run much faster:
In [5]:
n_data=10
!rm -rf /tmp/pygpstruct_demo
prepare_from_data_chain.learn_predict_gpstruct_wrapper(data_indices_train = np.arange(n_data), # training data= first 10 data files
data_indices_test = np.arange(n_data, n_data*2), # test data= next 10 files
result_prefix='/tmp/pygpstruct_demo/', # where to store the result files
data_folder = pygpstruct_location + '/data/japanesene/',
n_samples=501, # how many MCMC iterations
task='japanesene',
native_implementation = True,
prediction_thinning=100
)
In [6]:
import util
util.make_figure(np.arange(5), # display all 5 collected types of data
[('Japanese Named Entity Recognition task', '/tmp/pygpstruct_demo' + '/results.bin')], # can display several results on same plot by appending tuples to this list
bottom=None, top=None, save_pdf=False, max_display_length=2500)
The MCMC chain's state is saved at every MCMC step; that includes the state of the pseudo-random number generator, the last f, the current log likelihood, the marginals so far etc.
The MCMC chain can be interrupted between steps. However, there is no mechanism to prevent it from being stopped while it writes the state files to disk, which would cause the state to be lost. Interrupt the MCMC chain can be done thus:
When restarting with the same results directory, the program will check for the existence of saved state files. If it finds them, it will restart from the last saved state and continue the chain. It is the user's responsibility to ensure that the saved state corresponds exactly to the configuration of the experiment that is restarted. For instance, the training and test data should match, and the various other parameters passed to learn_predict_gpstruct should match as well.
For example, here we run the chain for 10 samples:
In [7]:
n_data=10
!rm -rf /tmp/pygpstruct_demo
prepare_from_data_chain.learn_predict_gpstruct_wrapper(data_indices_train = np.arange(n_data), # training data= first 10 data files
data_indices_test = np.arange(n_data, n_data*2), # test data= next 10 files
result_prefix='/tmp/pygpstruct_demo/', # where to store the result files
data_folder = pygpstruct_location + '/data/japanesene/',
n_samples=10, # how many MCMC iterations
task='japanesene',
prediction_thinning=1
)
... and now we can run the same chain for another 10 samples if we wish:
In [8]:
n_data=10
# !rm -rf /tmp/pygpstruct_demo # don't erase the results folder this time !
prepare_from_data_chain.learn_predict_gpstruct_wrapper(data_indices_train = np.arange(n_data), # training data= first 10 data files
data_indices_test = np.arange(n_data, n_data*2), # test data= next 10 files
result_prefix='/tmp/pygpstruct_demo/', # where to store the result files
data_folder = pygpstruct_location + '/data/japanesene/',
n_samples=20, # how many MCMC iterations
task='japanesene',
prediction_thinning=1
)
!rm -rf /tmp/pygpstruct_demo # clean up afterwards however
In [9]:
n_data=150
!rm -rf /tmp/pygpstruct_demo_long_run
prepare_from_data_chain.learn_predict_gpstruct_wrapper(data_indices_train = np.arange(n_data), # training data= first 10 data files
data_indices_test = np.arange(n_data, n_data*2), # test data= next 10 files
result_prefix='/tmp/pygpstruct_demo_long_run/', # where to store the result files
data_folder = pygpstruct_location + '/data/basenp/',
n_samples=1001, # how many MCMC iterations
task='basenp',
prediction_thinning=100,
native_implementation=True
)